arXiv: 1501.06478v2 [cs.LG] 2 Feb 2015 


Compressed Support Vector Machines 


Zhixiang (Eddie) Xu 

xuzx@cse.wustl.edu 


Jacob R. Gardner 

gardner.jake@gmail.com 


Stephen Tyree 

swtyreeSwustl.edu 


Kilian Q. Weinberger 

kilianSwustl.edu 


Department of Computer Science & Engineering 
Washington University in St. Louis 
St. Louis, MO, USA 


Abstract 


Support vector machines (SVM) can classify data sets along highly non-linear 
decision boundaries because of the kernel-trick. This expressiveness comes at a 
price; During test-time, the SVM classifier needs to compute the kernel inner- 
product between a test sample and all support vectors. With large training data 
sets, the time required for this computation can be substantial. In this paper, 
we introduce a post-processing algorithm, which compresses the learned SVM 
model by reducing and optimizing support vectors. We evaluate our algorithm 
on several medium-scaled real-world data sets, demonstrating that it maintains 
high test accuracy while reducing the test-time evaluation cost by several orders 
of magnitude—in some cases from hours to seconds. It is fair to say that most of 
the work in this paper was previously been invented by Burges and Scholkopf al¬ 
most 20 years ago. Lor most of the time during which we conducted this research, 
we were unaware of this prior work. However, in the past two decades, computing 
power has increased drastically, and we can therefore provide empirical insights 
that were not possible in their original paper. 

1 Introductions 

Support Vector Machines (SVM) are arguably one of the great success stories of machine learning 
and have been used in many real world applications, including email spam classification 0, face 
recognition HD and gene selection im. In real world applications, the evaluation cost (in terms 
of memory and CPU) during test-time is of crucial importance. This is particularly prominent in 
settings with strong resource constraints {e.g. embedded devices, cell phones or tablets) or frequently 
repeated tasks (e.g. webmail spam classification, web-search ranking, face detection in uploaded 
images), which can be performed billions of times per day. Reducing the resource requirements 
to classify an input can reduce hardware costs, enable product improvements, and help curb power 
consumption. 

Test-time cost is determined mainly by two components; classifier evaluation and feature extraction 
cost. Reducing feature extraction cost has recently obtained a significant amount of attention 0 
I5l[l5l[l7l[l9l|23l|25l|26l. These approaches reduce the test-time cost in scenarios where features 
are heterogeneous, extracted on-demand, and are significantly more expensive to compute than the 
classifier evaluation. 

In this paper, we focus on the other common scenario where the classifier evaluation cost dominates 
the overall test-time cost. Specifically, we focus on kernel support vector machine (SVM) ll20l . 
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Kernel computation can be expensive because it is linear in the number of support vectors and, in 
addition, often requires expensive exponentiation (e.g. for the radial basis or kernels). Previ¬ 
ous work has reduced the classifier complexity by selecting few support vectors through budgeted 
training i6l[24l or with heuristic selection prior to learning lfT3l . 

We describe an approach that does not select support vectors from the training set, but instead learns 
them to match a pre-defined S VM decision boundary. Given an existing S VM model with m support 
vectors, it learns r-^m “artificial support vectors”, which are not originally part of the training set. 
The resulting model is a standard SVM classifier (thus can be saved, for example, in a LibSVM ||4| 
compatible file). Relative to the original model, it has comparable accuracy, but it is up to several 
orders of magnitudes smaller and faster to evaluate. We refer to our algorithm as Compressed Vector 
Machine (CVM) and demonstrate on eight real-world data sets of various size and complexity that 
it achieves unmatched accuracy vs. test-time cost trade offs. 


2 Related Work 

Burges and Scholkopf invented Compressed Vector Machines long before us. While we conducted 
our research, we were not aware of their work until very late during the final stages of paper writing. 
We still consider our perspective and additional experiments valuable and decided to post our results 
as a techreport. However we do want to emphasize that all academic credit should go to them who 
were clearly ahead of us. 

Reducing test-time cost has recently attracted much attention. Much work ll^ l5l fTOl fTSl ITTl 1191 l23l 
l25l focuses on scenarios where features are extracted on-demand and the extraction cost dominates 
the overall test-time cost. Their objective is to minimize the feature extraction cost. 

Model compression was pioneered by 111. Our work was inspired by their vision, however it differs 
substantially, as we do not focus on ensembles of classifiers and instead learn a model compressor 
explicitly for SVMs. More recently, E6i introduces an algorithm to reduce the test-time cost specif¬ 
ically for the SVM classifier. However, similar to the approaches mentioned above, they focus on 
learning a new representations consisting of cheap non-linear features for linear SVMs. 

||6l propose an algorithm to limit the memory usage for kernel based online classification. Different 
from our approach, their algorithm is not a post-process procedure, and instead they modify the 
kernel function directly to limit the amount of memory the algorithm uses. Similar to ©, m also 
focusses on online kernel SVM, and attacks primarily the training time complexity. 

Of particular relevance is 03, which, specifically reduces the SVM evaluation cost by reducing the 
number of support vectors. Heuristics are used to select a small subset of support vectors, up to a 
given budget, during training time, thus solving an approximate SVM optimization. In contrast, our 
method is a post-processing compression to the regular SVM. We begin from an exact SVM solution 
and compress the set of support vectors by choosing and optimizing over a small set of support 
vectors to approximate the optimal decision boundary. This post-processing optimization framework 
renders unmatched accuracy and cost performance. Similar approaches have successfully learned 
pseudo-inputs for compressed nearest neighbor classification sets m and sparse Gaussian process 
regression models ll22ll . 

3 Background 

Let the data consist of input vectors {xi,..., x„} G and corresponding labels {yi ,..., G 
{ —1, -fl}. For simplicity we assume binary classification in the following section, but our algo¬ 
rithm is easily extended to multi-class settings using one-vs-one ED, one-vs-all d, or DAG HTbll 
approaches, and results are included for several multi-class datasets. 

Kernel support vector machines. SVMs are popular for their large margin enforcement, which 
leads to good generalization to unseen test data, and their formulation as a convex quadratic opti¬ 
mization problem, guaranteeing a globally optimal solution. Most importantly, the kernel-trick EOll 
may be employed to learn highly non-linear decision boundaries for data sets that are not linearly 
separable. Specifically, the kernel-trick maps the original feature space x^ into a higher (possibly 
infinite) dimensional space (/>(xi). 
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SVMs learn a hyperplane in this higher dimensional space by maximizing the margin and 
penalizing training instances on the wrong side of the hyperplane. 


min 
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where 5 is the bias, and C trades-off regularization/margin and training accuracy. Note that we 
use the quadratic hinge loss penalty and thus ([T]) is differentiable. The power of the kernel trick 
is that the higher dimensional space ^(xi) never needs to be expressed explicitly, because Q can 
be formulated in terms of inner products between input vectors. Let a matrix K denote these inner 
products, where Ky = (^(xi)^(^(xj), and K is the training kernel matrix. The optimization in (j^ 
can be then expressed in terms of kernel matrix K in the dual form: 
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s.t. ^ = 0 and > 0, 
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where are the Lagrange multipliers. 

the classification rule /(•) for a test input Xj can also be expressed by testing kernel K that consists 
of inner products between test inputs £ = {xt} and support vectors S = {x^ja^ ^ 0}, = 

0(xj)^0(xt), where 

n 

/((/.(xO)=^az2/,K,t + 5. (3) 

i=l 


Note that once testing kernel K is computed, generating the prediction is merely a linear combina¬ 
tion, and thus the dominating cost is computing the testing kernel itself. 

Least angle regression. LARS fS) is a widely used forward selection algorithm because of its 
simplicity and efficiency. Given input vectors x, target labels y, and the quadratic loss £{f3) = 
(x/3 — y)^, LARS learns to approximate targets by building up the coefficient vector j3 in successive 
steps, starting from an all-zero vector. To minimize the loss function £, LARS initially descends on 
a coordinate direction that has the largest gradient, 

di 

A=argmax—. (4) 

Pt dl3t 


The algorithm then incorporates this coordinate into its active set. After identifying the gradient di¬ 
rection, LARS selects the step size very carefully. Instead of too greedy or too tiny, LARS computes 
a step size that a new direction outside of the active set has the same maximum gradient as directions 
in the active set. LARS then include this new direction into the active set. 


In the following iterations, LARS gradient descends on a direction that maintains the same gradient 
for all directions in the active set. In other words, LARS descends following an equiangular direction 
of all directions in the active set. The algorithm then repeats computing step-size, including new 
directions into the active set, and descending on an equiangular directions. This process makes 
LARS very efficient, as after T iterations, LARS solution has exactly T directions in the active set, 
or equivalently, only T non-zero coefficients in (3. 


4 Method 

In this section, we detail the CVM approach to reduce the test-time SVM evaluation cost. We regard 
CVM as a post-processing compression to the original SVM solution. After solving an SVM, we 
obtain a set of support vectors S = {x^j ai ^ 0}, and the corresponding Lagrange multipliers ai. 
Given the original SVM solution, we can model the test-time evaluation cost explicitly. 

Kernel SVM evaluation cost. Based on the prediction function Q we can formulate the exact 
SVM classifier evaluation cost. Let e denote the cost of computing a test kernel entry (i.e. 
kernel function of a test input Xj and a support vector x^). We assume the computation cost is 
identical across all test inputs and all support vectors. As shown in Q, generating a prediction for a 
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testing input requires computing the kernel entry between the test input and all support vectors. The 
total evaluation cost is a function of the number of support vectors risv After obtaining the kernel 
entries for a test point Xt, prediction is simply linear combination of the kernel row K( weighted by 
a. The cost of computing this linear combination is very low compared to the kernel computation, 
and therefore the total evaluation cost Ce = nsyC. We aim to reduce the size of the support vector set 
risv without greatly affecting prediction accuracy. 

Removing non-support vectors. Since the test-time evaluation cost is a function of the number 
of support vectors, the goal is to cherry-pick and optimize a subset of the optimal support vectors 
bounded in size by a user-specified compression ratio. We first note that all non-support vectors 
can be removed during this process without affecting the full SVM solution. If we dehne a design 
matrix K S 7?."^", where . The squared penalty SVM objective function in can be 

expressed with Lagrange parameter a and the kernel matrix K: 

min ^ max (l-KQ:-y&,0))Va^Ka. (5) 

Since 0 is a strongly convex function, and all non-support vectors have the corresponding Lagrange 
multiplier ai = 0, we can remove all non-support vectors from the optimization problem and the 
full SVM optimal solution stays the same. 

To find an optimal subset of support vectors given the compression ratio, we re-train the SVM with 
only support vectors and a constraint on the number of support vectors. Note that a are effectively 
the coefficients of support vectors, and we can efficiently control the number of support vectors by 
adding an Iq norm on a. The optimization problem becomes 

min^l— Ka - -f (6) 

s.t.llallo < -Be, 

e 

where Be evaluation cost budget, and consequently, ^Be is the desired number of support vectors 
based on the budget. Note that after removing non-support vectors, we obtain a condensed matrix 

K e 

Forming ordinary least squares problem. The current form of equation (j^ can be made more 
amenable to optimization by rewriting the objective function as an ordinary least square problem. 
Expanding the squared term, simplifying, and fixing the bias term b (as it does not affect the solution 
dramatically), we re-format the objective function 0 into 

min(l — y5)^(l — yb) — 2q:^K^(1 — yb) + a^(K^K + Kja. (7) 

a. 

We introduce two auxiliary variables fl and /3, where 11^11 = K^K+K and (3 = —K^(l—y6). 
Because K^K -f K is a symmetric matrix, we can compute its eigen-decomposition 

K^K-fK = SDS^, (8) 

where D is the diagonal matrix of eigenvalues and S is the orthonormal matrix of eigenvectors. 
Moreover, because the matrix K^K-fK is positive semi-definite, we can further decompose SDS^ 
into an inner product of two real matrices by taking the square root of D. Let = V^S^, and we 
obtain a matrix that satisfies 17^17 = K^K -t- K. After computing 17, we can readily compute 
13 = -(f7T)-iK^(l - yb), where (17^)-^ = ^S^- 

With the help of the two auxiliary variables, we convert 0, plus a constant terrrQ into least squares 
format. Together with relaxation of the non-continuous Iq- norm constraint to an Zi-norm constraint, 
we obtain 

min(f7a-f/3)^, s.t. ||a||i < -i?e. (9) 

oc e 

‘(1 - yb)^ (k{K^K + - l) (1 - yb) 
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Figure 1: Illustration of searching for a subspace V G that best approximates predictions Pi and 
P 2 of training instances in TZ? space. Neither Vi or V 2 , spanned by existing columns in the kernel 
matrix, is a good approximation. V* spanned by kernel columns computed from two artificial 
support vectors is the optimal solution. 


Compressing the support vector set. The squared loss and li constraint in (j^ lead naturally to the 
LARS algorithm. Given a budget Be, we can determine the maximum size m of the compressed 
support vector set (m = ^). Using LARS, we start from an empty support vector set and add m 
support vectors incrementally. Since adding a support vector is equivalent to activating a coefficient 
in a to a non-zero value, we can obtain m optimal support vectors by running LARS optimization 
in (j9]) exactly m steps, where each step activates one coefficient. The resulting solution gives the 
optimal set of m support vectors. We refer this intermediate step as LARS-SVM. Note that this step 
is crucial for the problem, as this LARS-SVM solution serves as a very good initialization for the 
next step, which is a non-convex optimization problem. 

Gradient support vectors. If we interpret a as coordinates and the corresponding columns in the 
kernel matrix K as basis vectors, then these basis vectors span an 7?."=’' space in which lie predictions 
of the original SVM model. In this compression algorithm, our goal is to find a lower dimensional 
subspace that supports good approximations of the original predictions. After running LARS for 
m iterations, we obtain m support vectors and their coefficients a, forming an TZ'^ subspace of the 
space spanned by the full kernel matrix. 

We illustrate this lower dimensional approximation in Figure[T] Vectors Pi and P 2 are predictions of 
two training points made in the full SVM solution space (TZ^ and spanned by three support vectors). 
We want to compress the model to two support vectors by looking for a subspace V G P? that 
supports the best approximations of these two predictions. Using existing support vectors as a basis, 
we can find subspaces Vi and V 2 , each spanned by a pair of support vectors. The projections of Pi 
and P 2 on plane Vi (P^^ and P ^^) are closest to the original predictions Pi and P 2 , and thus Vi is the 
better approximation. However, in this case, neither Vi nor V 2 is a particularly good approximation. 
Suppose we remove the restriction of selecting a subspace spanned by existing basis vectors in the 
kernel matrix, instead optimizing the basis vectors to yield a more suitable subspace. In Figure 
this is illustrated by the optimal subspace V* which produces a better approximation to the target 
predictions. 

Note that the basis vectors (columns of the kernel matrix) are parameterized by support vectors. By 
optimizing these underlying support vectors, we can search for a better low-dimensional subspace. 
If we denote as the training kernel matrix with only m columns corresponding to the support 
vectors chosen by LARS, and ctm as the coefficients of these support vectors, we can formulate the 
search for artificial support vector as an optimization problem. Specifically, we minimize a squared 
loss between approximate and full SVM predictions over all support vectors, and the parameters are 
support vectors. 

min £ = (KmCtm - Ka) , (10) 

where is the kernel entry, and for simplicity, we use radial basis function (RBF) kernel function 

llxj-x, 11^ 

(Kij = e ^ ). However, other kernel functions are equally suitable. The unconstrained op- 
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Figure 2: Illustration of each step of CVM on a synthetic data set. (a) Simulation inputs from two 
classes (red and blue). By design, the two classes are not linear separable, (b) Decision boundary 
formed by full SVM solution (black curve). A small subset of support vectors picked by LARS (gray 
points) and the compressed decision boundary formed by this subset of support vectors (gray curve), 
(c-h) Optimization iterations. The gradient support vectors are moved by the iterative optimization. 
The optimized decision boundary formed by gradient support vectors (green curve) gradually ap¬ 
proaches the one formed by full SVM solution. 


timization problem ( [T0| ) can be solved by conjugate gradient descent with respect to the chosen m 
support vectors. Since a’s are the coordinates with respect to the basis, we optimize a jointly with 
support vectors, which is faster than optimizing basis and solving coordinates alternatively. The gra¬ 
dients can be computed very efficiently using matrix operations. Since gradient descent on support 
vectors is equivalent to moving these support vectors in a continuous space, thereby generating m 
new support vectors, we refer to these newly generated support vectors as gradient support vectors. 
We denote this combined method of LARS-SVM and gradient support vectors as Compressed Vec¬ 
tor Machine (CVM). Because the optimization problem in ( [T0| ) is non-convex with respect to x^, we 
initialize our algorithm with the basis and coordinates returned in the LARS-SVM solution. 

In practice, it may be desirable to optimize both the SVM cost parameter C and any kernel param¬ 
eters (e.g. cr^ in the RBF kernel) for the hnal CVM model. Additionally, it may be preferable to 
optimize CVM constrained by the validation accuracy of the compressed model rather than the size 
of the support vector budget. Constrained Bayesian optimization a supports efficient constrained 
joint hyperparameter optimizations of this type. Additionally, the Ll-penalized support vector se¬ 
lection in the LARS-SVM step may beneht from recent work on highly parallel Elastic Net solvers 

El. 

5 Results 

In this section, we hrst demonstrate Compressed Vector Machine (CVM) on a synthetic data set to 
graphically illustrate each step in the algorithm. We then evaluate CVM on several medium-scale 
real-world data sets. 

Synthetic data set. The data set contains 600 sample inputs from two classes (red and blue), where 
each input contains two features. The blue inputs are sampled from a Gaussian distribution with 
mean at the origin and variance 1, and red inputs are sampled from a noisy circle surrounding the 
blue inputs. As shown in Figure]^ a), by design the data set is not linearly separable. For simplicity. 
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we treat all inputs as training inputs. To evaluate CVM, we first learn an SVM with the RBF kernel 
from the full training set. We plot the resulting optimal decision boundary in Figure |^b) with a 
black curve. In total, the full model has 80 support vectors. 

To compress the model, we first select a subset of support vectors by solving LARS-SVM opti¬ 
mization Specifically, we compress the model to 10% of its original size, 8 support vectors, by 
running LARS for 8 iterations. The 8 LARS-SVM support vectors are shown in Figurej^b) as solid 
gray points, and the approximate LARS-SVM decision boundary is shown by the gray curve. 

Since the subspace formed by 8 support vectors is heavily restricted by the discrete training input 
space, the approximation is poor. To overcome this problem, we search for a better subspace or basis 
in a continuous space, and perform gradient descent on support vectors by optimizing ( [T0| ). In Figure 
l^c-h), we illustrate the optimization with updated support vector locations and optimized decision 
boundaries as we gradually increase the number of iterations. The resulting gradient support vectors 
are shown as gray points and the new optimized decision boundaries formed from these new gradient 
support vectors are shown by green curves. After 2560 iterations, as shown in Figure [^h), we can 
observe that the optimized decision boundary (green) is very close to the boundary captured in the 
full model (black). These optimized decision boundaries demonstrate that moving a small subset 
of support vectors in a continuous space can efficiently approximate the optimal decision boundary 
formed by full SVM solution, supporting effective SVM model compression. 


Statistics 

Pageblocks 

Magic 

Letters 

20news 

MNIST 

DMOZ 

#training exam. 

4379 

15216 

16000 

11269 

60000 

7184 

#testing exam. 

1094 

3804 

4000 

7505 

10000 

1796 

#features 

10 

10 

16 

200 

784 

16498 

#classes 

2 

2 

26 

20 

10 

16 


Table 1: Statistics of all six data sets. 


Large real-world data sets. To evaluate the performance of CVM on real-world applications, we 
evaluate our algorithm on six data sets of varying size, dimensionality and complexity. Table [T] 
details the statistics of all six data sets. We use LibSVM Q to train a regular RBF kernel SVM 
using regularization parameter C and RBF kernel width a selected on a 20% validation split. For 
multi-class data sets, we use the one-vs-one multi-class scheme. The classification accuracy of test 
predictions from this SVM model serves as a baseline in Figure|^full SVM). 

Given the full SVM solution, we run CVM in two steps. First, we use LARS solve the optimization 
problem in (|^ using all support vectors from the original SVM model. An initial compressed 
support vector set is selected with a target compressed size (e.g. 10 out of 500 support vectors). 
The selected support vectors serve as the second baseline in Figure[^LARS-S VM). Second, we shift 
these support vectors in a continuous space by optimizing ( |T0) | w.r.t. the input support vectors and the 
corresponding Lagrange multipliers a, generating gradient support vectors. This final set of gradient 
support vectors constitutes the CVM model. To show the trend of accuracy/cost performance, we 
plot the classification accuracy for CVM after adding every 10 support vectors. Figure|^shows the 
performance of CVM and the baselines on all six data sets. 

Comparison with prior work. Figurej^also shows a comparison of CVM with Reduced-SVM ifTSl . 
This algorithm takes an iterative two phase approach. First a set of support vectors is heuristically 
selected from random samples of the training set and added to the existing set of support vectors 
(initially empty). Then, the model weights are optimized by an SVM with the quadratic hinge loss. 
The algorithm alternates these two steps until the target number of support vectors is reached. 

As shown in the Figure]^ CVM significantly improves over all baselines. Compared to the current 
state-of-the-art, Reduced-SVM, CVM has the capability of moving support vectors, generating a 
new basis, and learning a highly approximated basis to match the decision boundaries formed by 
the full SVM solution. It is this ability that distinguishes CVM from other algorithms when the 
evaluation budget is low. Across all data sets, CVM maintains close to the same accuracy as the full 
SVM with merely 10% of the support vectors. 
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Figure 3: Accuracy versus number of support vectors (in log scale). 
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